Software Tools for Earth and Environmental Science

– 12th WEEK –

Emir Toker

06/12/2019

R Probability

  • Syllabus and Book
  • DataCamp Class
  • R - Elemantary Statistics
  • Basic Data Visualization
  • R - Probability

Coffee Break

  • R - Common Probability Distributions
  • Practice : Write A Function

QUIZ

Next Week

Syllabus and Book

Syllabus

Book

DataCamp Class

DataCamp Class

LINK

R - Elemantary Statistics

R - Elemantary Statistics

R - Elemantary Statistics

Centrality: Mean, Median, Mode

  • Measures of centrality are commonly used to explain large collections of data by describing where numeric observations are centered.

R - Elemantary Statistics

Centrality: Mean, Median, Mode

  • Median : “middle magnitude” of your observations

0 . 0 . 0 . 0 . 0

o . o . o . o . o . o

R - Elemantary Statistics

Centrality: Mean, Median, Mode

  • Mode : Simply the “most common” observation.

Sample : 2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4

2 , 2 , 2 , 2.2 , 3 , 3 , 4, 4.4 ( n=8 , n/2 = 4)

R - Elemantary Statistics

Centrality: Mean, Median, Mode

xdata <- c(2,4.4,3,3,2,2.2,2,4)

  • mean(xdata)
  • median(xdata)
  • min(xdata)
  • max(xdata)
  • range(xdata)

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

  • A quantile is a value indicates an observation rank when compared to all the other present observations.
  • For example, the median is itself a quantile. It’s the 0.5th quantile.
  • Alternatively, quantiles can be expressed as a percentile.

The median = the 0.5th quantile = The 50th percentile

Sample : 2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4

2 , 2 , 2 , 2.2 , 3 , 3 , 4, 4.4

0.5th quantile = median = 2.6

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

xdata <- c(2,4.4,3,3,2,2.2,2,4)
quantile(xdata,prob=0.8) # the 0.8th quan- tile (or 80th percentile)
## 80% 
## 3.6
quantile(xdata,prob=c(0,0.25,0.5,0.75,1))
##   0%  25%  50%  75% 100% 
## 2.00 2.00 2.60 3.25 4.40
summary(xdata) # A quartile is a type of quantile.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.600   2.825   3.250   4.400

R - Elemantary Statistics

Quantiles, Percentiles, and the Five-Number Summary

A quartile is a type of quantile.

R - Elemantary Statistics

Quantiles , Percentiles, and the Five-Number Summary

xdata <- c(2,4.4,3,3,2,2.2,2,4)

  • quantile(xdata)
  • summary(xdata)

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

  • How dispersed your data are. For this, measures of spread are needed.
xdata <- c(2,4.4,3,3,2,2.2,2,4)
ydata <- c(1,4.4,1,3,2,2.2,2,7)

mean(xdata)
## [1] 2.825
mean(ydata)
## [1] 2.825

R - Elemantary Statistics

plot(xdata,type="n",xlab="",ylab="data vector",yaxt="n",bty="n")
abline(h=c(3,3.5),lty=2,col="red")
abline(v=2.825,lwd=2,lty=3)
text(c(0.8,0.8),c(3,3.5),labels=c("x","y"))
points(jitter(c(xdata,ydata)),c(rep(3,length(xdata)), rep(3.5,length(ydata))))

the observations in ydata are more “spread out”

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

  • The sample variance measures the degree of the spread of numeric observations around their arithmetic mean.

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

2 , 4.4 , 3 , 3 , 2 , 2.2 , 2 , 4 ( mean = 2.825)

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

  • The standard deviation is simply the square root of the variance. The scale of the original observations.

0.953 represents the average distance of each observation from the mean

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range,

  • Unlike the variance and standard deviation, the interquartile range (IQR) is not computed with respect to the sample mean.

  • IQR is computed as the difference between the upper and lower quartiles of your data

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

xdata <- c(2,4.4,3,3,2,2.2,2,4)

var(xdata)
## [1] 0.9078571
sd(xdata)
## [1] 0.9528154
IQR(xdata)
## [1] 1.25

R - Elemantary Statistics

R - Elemantary Statistics

R - Elemantary Statistics

Spread: Variance, Standard Deviation, and the Interquartile Range

xdata <- c(2,4.4,3,3,2,2.2,2,4)

  • var()
  • sd()
  • IQR()

R - Elemantary Statistics

Covariance and Correlation

  • Investigate the relationship between two numeric variables to assess trends
  • The covariance expresses how much two numeric variables “change together” and the nature of that relationship, whether it is positive or negative.

R - Elemantary Statistics

Covariance and Correlation

x = {x1,x2,…,xn}

y = {y1,y2,…,yn}

for i = 1,. . . ,n

When you get a positive result for rxy, it shows that there is a positive lin- ear relationship. When rxy = 0, this indicates that there is no linear relationship.

R - Elemantary Statistics

Covariance and Correlation

x = {2,4.4,3,3,2,2.2,2,4}

y = {1,4.4,1,3,2,2.2,2,7}

mean x and y = 2.825

positive relationship

R - Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5 )

R - Elemantary Statistics

Covariance and Correlation

  • Correlation allows you to interpret the covariance further by identifying the strength of any association.

R - Elemantary Statistics

Covariance and Correlation

  • Most common of these is Pearson’s product-moment correlation coefficient. (R default)

  • The correlation coefficient estimates the nature of the linear relationship between two sets of observations

−1 ≤ ρxy ≤ 1

ρxy = 1, which is a perfect positive linear relationship

R - Elemantary Statistics

Covariance and Correlation

x = {2,4.4,3,3,2,2.2,2,4}

y = {1,4.4,1,3,2,2.2,2,7}

(mean x and y = 2.825)

(sx = 0.953 and sy = 2.013)

(rxy = 1.479)

ρxy is positive

R - Elemantary Statistics

Covariance and Correlation

x <- c(2,4.4,3,3,2,2.2,2,4)
y <- c(1,4.4,1,3,2,2.2,2,7)
plot(x,y, col="red", pch=13,cex=1.5)
abline(lm(y ~ x))

R - Elemantary Statistics

Covariance and Correlation

xdata <- c(2,4.4,3,3,2,2.2,2,4)
ydata <- c(1,4.4,1,3,2,2.2,2,7)

cov(xdata,ydata)
## [1] 1.479286
cor(xdata,ydata)
## [1] 0.7713962

R - Elemantary Statistics

Covariance and Correlation

  • cov( , )
  • cor( , )

R - Elemantary Statistics

R - Elemantary Statistics

Basic Data Visualization

Basic Data Visualization

Barplots and Pie Charts

station_data <- read.csv("https://web.itu.edu.tr/~tokerem/18397_Cekmekoy_Omerli_15dk.txt", header=T, sep = ";")

head(station_data)
##   sta_no year month day hour minutes temp precipitation pressure
## 1  18397 2017     7  26   18       0 23.9             0   1003.0
## 2  18397 2017     7  26   18      15 23.9             0   1003.1
## 3  18397 2017     7  26   18      30 23.8             0   1003.2
## 4  18397 2017     7  26   18      45 23.8             0   1003.2
## 5  18397 2017     7  26   19       0 23.6             0   1003.2
## 6  18397 2017     7  26   19      15 23.2             0   1003.1
##   relative_humidity
## 1                94
## 2                95
## 3                96
## 4                96
## 5                96
## 6                97

Basic Data Visualization

Barplots and Pie Charts

head(station_data$temp)
## [1] 23.9 23.9 23.8 23.8 23.6 23.2
barplot(station_data$temp)

Basic Data Visualization

Barplots and Pie Charts

station_data$temp
##   [1] 23.9 23.9 23.8 23.8 23.6 23.2 23.2 23.1 23.0 22.8 22.5 22.4 22.2 22.3
##  [15] 22.2 21.7 21.9 21.7 21.6 22.2 22.2 22.1 22.3 22.5 22.3 22.2 22.5 22.6
##  [29] 22.6 22.6 22.6 22.7 22.6 22.5 22.6 22.5 22.5 22.4 22.5 22.4 22.5 22.6
##  [43] 23.0 23.2 24.2 25.1 25.5 26.1 27.1 26.9 27.6 28.0 28.4 28.5 29.3 30.2
##  [57] 30.1 30.1 30.4 30.4 30.8 30.9 31.0 31.5 31.2 30.9 30.9 30.4 30.4 30.0
##  [71] 29.2 29.5 29.4 29.3 29.6 28.8 29.0 29.0 29.2 28.4 27.8 27.4 26.6 26.2
##  [85] 25.8 25.6 25.4 24.2 19.2 19.5 20.1 20.8 21.2 21.4 21.4 21.4 21.2 21.0
##  [99] 20.8 20.9 20.8 20.7 20.8 20.8 20.9 20.6 20.6 20.5 20.7 20.8 20.4 20.4
## [113] 20.6 20.5 20.4 20.5 20.5 20.6 20.5 20.5 20.4

Basic Data Visualization

Barplots and Pie Charts

table(station_data$temp)
## 
## 19.2 19.5 20.1 20.4 20.5 20.6 20.7 20.8 20.9   21 21.2 21.4 21.6 21.7 21.9 
##    1    1    1    4    6    4    2    6    2    1    2    3    1    2    1 
## 22.1 22.2 22.3 22.4 22.5 22.6 22.7 22.8   23 23.1 23.2 23.6 23.8 23.9 24.2 
##    1    5    3    3    8    7    1    1    2    1    3    1    2    2    2 
## 25.1 25.4 25.5 25.6 25.8 26.1 26.2 26.6 26.9 27.1 27.4 27.6 27.8   28 28.4 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    2 
## 28.5 28.8   29 29.2 29.3 29.4 29.5 29.6   30 30.1 30.2 30.4 30.8 30.9   31 
##    1    1    2    2    2    1    1    1    1    2    1    4    1    3    1 
## 31.2 31.5 
##    1    1
f_temp <- table(station_data$temp)

Basic Data Visualization

Barplots and Pie Charts

barplot(f_temp)

Basic Data Visualization

Barplots and Pie Charts

barplot(f_temp,beside=TRUE,horiz=TRUE,las=1,
        main="Frequency of Station Temperature",
        names.arg=c("T"),legend.text=c("TEMP-f"),
        args.legend=list(x="bottomright"))

Basic Data Visualization

Barplots and Pie Charts

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
qplot(factor(station_data$temp),geom="bar")

Basic Data Visualization

Barplots and Pie Charts

head(station_data$precipitation)

pie(table(station_data$precipitation),labels=c("V1","V2","V3","V4","V5"),col=c("white","blue","green","orange"),main="pie chart for precipitation")

Basic Data Visualization

Histogram

hist(station_data$temp)

Basic Data Visualization

Histogram

hist(station_data$tem,breaks=seq(19,32,1),col="green",main="Temp",xlab="HP")
abline(v=c(mean(station_data$temp),median(station_data$temp)), col=c("blue","red"),lty=c(2,3),lwd=2)
legend("topright",legend=c("mean T","median T"),lty=c(2,3),lwd=2,col=c("blue","red"))

Basic Data Visualization

Histogram

qplot(station_data$temp,geom="blank",main="Temp Hist",xlab="Temp")+ 
  geom_histogram(color="black",fill="white",breaks=seq(19,32,1),closed="right") + 
  geom_vline(mapping=aes(xintercept=c(mean(station_data$tem), median(station_data$tem)), linetype=factor(c("mean","median"))) , col=c("blue","red"),show.legend=TRUE)+ 
  scale_linetype_manual(values=c(2,3)) + 
  labs(linetype="")

Basic Data Visualization

Boxplot

boxplot(station_data$temp)

mean(station_data$temp)
## [1] 24.24132
median(station_data$temp)
## [1] 22.6
quantile(station_data$temp)
##   0%  25%  50%  75% 100% 
## 19.2 21.4 22.6 27.6 31.5
# summary(station_data$temp)

Basic Data Visualization

Histogram and Boxplot

hist(station_data$temp)
abline(v=median(station_data$temp),col="red")

boxplot(station_data$temp,horizontal = T)
abline(v=median(station_data$temp),col="red")

Basic Data Visualization

Boxplot

Basic Data Visualization

Scatter Plots

plot(station_data$temp,station_data$relative_humidity)

Basic Data Visualization

Scatter Plots

head(station_data[,7:10])
##   temp precipitation pressure relative_humidity
## 1 23.9             0   1003.0                94
## 2 23.9             0   1003.1                95
## 3 23.8             0   1003.2                96
## 4 23.8             0   1003.2                96
## 5 23.6             0   1003.2                96
## 6 23.2             0   1003.1                97

Basic Data Visualization

library("GGally")
ggpairs(station_data[,7:10],axisLabels="internal")

R - Probability

R - Probability

A probability is a number that describes the “magnitude of chance” associated with making a particular observation or statement.

It’s always a number between 0 and 1 (inclusive) and is often expressed as a fraction.

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability distribution")

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
X.cumul <- cumsum(X.prob)
barplot(X.cumul,names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X <= x)", main = "cumulative probability distribution")

X.outcomes <- c(2:12)
X.prob <- c((1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability distribution")
abline(v=c(0.5:10.5))

PDF - Probability Density Function

lower < 7 < upper

X >= 2  &  X <= 7
(X[lower] - 1)/36

X > 7 & X <= 12
13 - X[upper])/36

X.outcomes <- c(1,2,3,4,5,6,7,8,9,10,11,12,13)

lower <- X.outcomes >= 2 & X.outcomes <= 7
upper <- X.outcomes > 7 & X.outcomes <= 12

fx <- rep(0,length(X.outcomes))
fx[lower] <- (X.outcomes[lower] - 1)/36
fx[upper] <- (13 - X.outcomes[upper])/36

plot(X.outcomes,fx,type="l",ylab="f(x)", xlim = c(0,14), main = "probability density function")
abline(h=0,col="gray",lty=2)

fx.specific <- (4.5-1)/36

fx.specific.area <- 3.5*fx.specific*0.5

fx.specific.vertices <- rbind(c(1,0),c(4.5,0),c(4.5,fx.specific))

plot(X.outcomes,fx,type="l",ylab="f(x)", xlim = c(0,14), main = "probability density function")
abline(h=0,col="gray",lty=2)
polygon(fx.specific.vertices,col="gray",border=NA)
abline(v=4.5,lty=3)
text(4,0.01,labels=fx.specific.area)

R - Probability - Shape

  • Symmetry : Draw a vertical line down the center, and it is equally reflected with 0.5 probability.

  • Skew : If a distribution is asymmetric, look at the “tail” of a distribution. Positive or right skew indicates a tail extending longer to the right of center.

  • Modality : Modality describes the number of easily identifiable peaks in the distribution of interest. Unimodal, bimodal, and trimodal…

table(station_data$temp)
## 
## 19.2 19.5 20.1 20.4 20.5 20.6 20.7 20.8 20.9   21 21.2 21.4 21.6 21.7 21.9 
##    1    1    1    4    6    4    2    6    2    1    2    3    1    2    1 
## 22.1 22.2 22.3 22.4 22.5 22.6 22.7 22.8   23 23.1 23.2 23.6 23.8 23.9 24.2 
##    1    5    3    3    8    7    1    1    2    1    3    1    2    2    2 
## 25.1 25.4 25.5 25.6 25.8 26.1 26.2 26.6 26.9 27.1 27.4 27.6 27.8   28 28.4 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    2 
## 28.5 28.8   29 29.2 29.3 29.4 29.5 29.6   30 30.1 30.2 30.4 30.8 30.9   31 
##    1    1    2    2    2    1    1    1    1    2    1    4    1    3    1 
## 31.2 31.5 
##    1    1
df_temp_table <- data.frame(table(station_data$temp))
df_temp_table
##    Var1 Freq
## 1  19.2    1
## 2  19.5    1
## 3  20.1    1
## 4  20.4    4
## 5  20.5    6
## 6  20.6    4
## 7  20.7    2
## 8  20.8    6
## 9  20.9    2
## 10   21    1
## 11 21.2    2
## 12 21.4    3
## 13 21.6    1
## 14 21.7    2
## 15 21.9    1
## 16 22.1    1
## 17 22.2    5
## 18 22.3    3
## 19 22.4    3
## 20 22.5    8
## 21 22.6    7
## 22 22.7    1
## 23 22.8    1
## 24   23    2
## 25 23.1    1
## 26 23.2    3
## 27 23.6    1
## 28 23.8    2
## 29 23.9    2
## 30 24.2    2
## 31 25.1    1
## 32 25.4    1
## 33 25.5    1
## 34 25.6    1
## 35 25.8    1
## 36 26.1    1
## 37 26.2    1
## 38 26.6    1
## 39 26.9    1
## 40 27.1    1
## 41 27.4    1
## 42 27.6    1
## 43 27.8    1
## 44   28    1
## 45 28.4    2
## 46 28.5    1
## 47 28.8    1
## 48   29    2
## 49 29.2    2
## 50 29.3    2
## 51 29.4    1
## 52 29.5    1
## 53 29.6    1
## 54   30    1
## 55 30.1    2
## 56 30.2    1
## 57 30.4    4
## 58 30.8    1
## 59 30.9    3
## 60   31    1
## 61 31.2    1
## 62 31.5    1

barplot(df_temp_table$Freq/121,names.arg=df_temp_table$Var1)

R - Common Probability Distributions

R - Common Probability Mass Functions

For discrete random variables

  • Bernoulli Distribution : has only two possible outcomes, such as success or failure.

Bernoulli Distribution

x<-1
p <- 0.6

b_fx <- p^x*((1-p)^(1-x))

barplot(c(1-p,p),names.arg=c(0,1))

R - Common Probability Mass Functions

For discrete random variables

  • Binomial Distribution : The binomial distribution is the distribution of successes in n number of trials involving binary discrete random variables.

Binomial Distribution

There are four functions associated with Binomial distributions.

  • dbinom(x, size, prob)
  • pbinom(x, size, prob)
  • qbinom(p, size, prob)
  • rbinom(n, size, prob)
  • x is a vector of numbers.
  • p is a vector of probabilities.
  • n is number of observations.
  • size is the number of trials.
  • prob is the probability of success of each trial.

Binomial Distribution - dbinom

It is a density or distribution function.

x <- 1
size <- 8
prob <- 1/2
dbinom(x , size , prob)
## [1] 0.03125
x <- 4
dbinom(x , size , prob)
## [1] 0.2734375
x <- 0:8
dbinom(x , size , prob)
## [1] 0.00390625 0.03125000 0.10937500 0.21875000 0.27343750 0.21875000
## [7] 0.10937500 0.03125000 0.00390625

Binomial Distribution - dbinom

bin <- dbinom(x = 0:8 , size = 8 , prob = 0.5)
plot(x=0:8, y = bin)

Binomial Distribution - dbinom

X.outcomes <- c(1:13)
X.prob <- c((0/36),(1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36),(0/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability distribution")

Binomial Distribution - dbinom

X.outcomes <- c(1:13)
X.prob <- c((0/36),(1/36),(2/36),(3/36),(4/36),(5/36),(6/36),(5/36),(4/36),(3/36),(2/36),(1/36),(0/36))
barplot(X.prob,ylim=c(0,0.20),names.arg=X.outcomes,space=0,xlab="x",ylab="Pr(X = x)", main = "probability distribution")

lines(dbinom(x = 0:12, size = 36, prob = 1/6), col= "red")

R - Common Probability Mass Functions

  • Poisson Distribution : important and rarely seen event.

λp should be interpreted as the “mean number of occurrences”

Poisson Distribution

There are three functions associated with Binomial distributions.

  • dpois(x, lambda)
  • ppois(q, lambda, lower.tail)
  • qpois(p, lambda, lower.tail)
  • rpois(n, lambda)
  • x : successes in a period
  • λ : the expected number of events
  • lower.tail = TRUE for left tail
  • q vector of quantiles
  • n number of random values to return
  • p vector of probabilities

Poisson Distribution

plot(dpois(0:10,2.22),type = "o", col="red")
lines(dpois(0:10,4.22), type = "o", col = "blue")
lines(dpois(0:10,7.22), type = "o", col = "green")

Next Week

  • R - Common Probability Density Functions

    • Uniform
    • Normal
    • Student’s t-distribution
    • Exponential
    • (gamma, beta, log-normal)
  • Practice : Write A Function

  • Practice : Netcdf Packages

  • QUIZ